{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "37tFN3XcxyPc" }, "source": [ "# **Aula Machine Learning - Trainee GVCode**" ] }, { "cell_type": "markdown", "metadata": { "id": "dP-sIhXAxyPh" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "D3NPgIuHxyPi" }, "source": [ ">Machine Learning is making the computer learn from studying data and statistics.\n", ">\n", ">Machine Learning is a step into the direction of artificial intelligence (AI).\n", ">\n", ">Machine Learning is a program that analyses data and learns to predict the outcome." ] }, { "cell_type": "markdown", "metadata": { "id": "nMekDh3rxyPj" }, "source": [ "Esse guia será dividido em 2 partes principais que são essenciais para a criação de um modelo funcional e utilizável:\n", "\n", "1. Pré-processamento dos dados e análise exploratória\n", "\n", " 1.1. Carregamento dos dados\\\n", " 1.2. Limpeza dos dados\\\n", " 1.3. Análise exploratória\n", "\n", "2. Modelagem\n", "\n", " 2.1. Criação do modelo\\\n", " 2.2. Resultados\\\n", " 2.3. Validação do modelo" ] }, { "cell_type": "markdown", "metadata": { "id": "I0HfReNaxyPk" }, "source": [ "## **Parte 1 - Pré-processamento dos dados e análise exploratória** - Problema Supervisionado (A máquina vai ver os dados)" ] }, { "cell_type": "markdown", "metadata": { "id": "DLRqyw231YZN" }, "source": [ "### Pré-processamento" ] }, { "cell_type": "markdown", "metadata": { "id": "dY86oMXuxyPk" }, "source": [ "\n", "\n", "O pré-processamento é uma parte crucial que deve ser feita sempre no ínicio de um projeto de Data Science (ao menos que algúem tenha feito para você). Isso inclui:\n", "* Lidar com valores NULL, ou seja, sem valor\n", "* Remover colunas e linhas com informações irrelevantes\n", "* Detectar outliers\n", "* Limpar os dados em geral." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "KYt19bBqxyPl" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "O dataset pode ser encontrado no [Kaggle](https://www.kaggle.com/datasets/sujay1844/used-car-prices/data), e no [GitHub do trainee](../data/car_old.csv)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "Fm7sA20uxyPo", "outputId": "29d1bf4b-2cf6-4113-9f29-e08e159808ee" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0NameLocationYearKilometers_DrivenFuel_TypeTransmissionOwner_TypeMileageEnginePowerSeatsNew_PricePrice
00Maruti Wagon R LXI CNGMumbai201072000CNGManualFirst26.6 km/kg998 CC58.16 bhp5.0NaN1.75
11Hyundai Creta 1.6 CRDi SX OptionPune201541000DieselManualFirst19.67 kmpl1582 CC126.2 bhp5.0NaN12.50
22Honda Jazz VChennai201146000PetrolManualFirst18.2 kmpl1199 CC88.7 bhp5.08.61 Lakh4.50
33Maruti Ertiga VDIChennai201287000DieselManualFirst20.77 kmpl1248 CC88.76 bhp7.0NaN6.00
44Audi A4 New 2.0 TDI MultitronicCoimbatore201340670DieselAutomaticSecond15.2 kmpl1968 CC140.8 bhp5.0NaN17.74
\n", "
" ], "text/plain": [ " Unnamed: 0 Name Location Year \\\n", "0 0 Maruti Wagon R LXI CNG Mumbai 2010 \n", "1 1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 \n", "2 2 Honda Jazz V Chennai 2011 \n", "3 3 Maruti Ertiga VDI Chennai 2012 \n", "4 4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 \n", "\n", " Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine \\\n", "0 72000 CNG Manual First 26.6 km/kg 998 CC \n", "1 41000 Diesel Manual First 19.67 kmpl 1582 CC \n", "2 46000 Petrol Manual First 18.2 kmpl 1199 CC \n", "3 87000 Diesel Manual First 20.77 kmpl 1248 CC \n", "4 40670 Diesel Automatic Second 15.2 kmpl 1968 CC \n", "\n", " Power Seats New_Price Price \n", "0 58.16 bhp 5.0 NaN 1.75 \n", "1 126.2 bhp 5.0 NaN 12.50 \n", "2 88.7 bhp 5.0 8.61 Lakh 4.50 \n", "3 88.76 bhp 7.0 NaN 6.00 \n", "4 140.8 bhp 5.0 NaN 17.74 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# importing dataset\n", "car_train = pd.read_csv('car_train.csv')\n", "\n", "# printing the first 5 rows of dataset\n", "car_train.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "tAc6ogO_xyPs", "outputId": "d08f26fc-a5ea-405f-8e93-46dea6acc758" }, "outputs": [ { "data": { "text/plain": [ "['Unnamed: 0',\n", " 'Name',\n", " 'Location',\n", " 'Year',\n", " 'Kilometers_Driven',\n", " 'Fuel_Type',\n", " 'Transmission',\n", " 'Owner_Type',\n", " 'Mileage',\n", " 'Engine',\n", " 'Power',\n", " 'Seats',\n", " 'New_Price',\n", " 'Price']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# getting the columns of the dataset\n", "columns = list(car_train.columns)\n", "columns" ] }, { "cell_type": "markdown", "metadata": { "id": "IusdUm-YxyPu" }, "source": [ "Identifique e anote potenciais problemas que você terá que lidar no DataFrame:\n", "\n", "* Estão faltam valores em algumas colunas (NaN). Isso poderá causar muitos problemas para a análise e modelagem se não resolvido no início do processo.\n", "\n", "* Algumas colunas possuem palavras e números, como o mileage, engine e power. Isso pode ser um problema se a gente quiser montar um gráfico de time-series (evolução dos dados ao longo do tempo) pela data ou outros gráficos para explorar a relação da duration com outras variáveis.\n", "\n", "* Existem diversas colunas com os strings que devem ser transformadas em dummies para que o modelo possa ser treinado." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "ku4cm4PixyPv", "outputId": "7160d037-627a-48d0-e214-d6da0d1b42c9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Missing values distribution: \n", "Unnamed: 0 0\n", "Name 0\n", "Location 0\n", "Year 0\n", "Kilometers_Driven 0\n", "Fuel_Type 0\n", "Transmission 0\n", "Owner_Type 0\n", "Mileage 2\n", "Engine 36\n", "Power 36\n", "Seats 42\n", "New_Price 5195\n", "Price 0\n", "dtype: int64\n", "\n" ] } ], "source": [ "# examining missing values - Quantidades de valores ausentes em cada coluna \n", "print(\"Missing values distribution: \")\n", "print(car_train.isna().sum())\n", "print(\"\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "5HrnrA96xyPw", "outputId": "7d439fb7-d0d2-4ce4-dc54-5df6055b7065" }, "outputs": [ { "data": { "text/plain": [ "(6019, 14)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# O tamanho do nosso DataFrame - Linhas e Colunas \n", "\n", "car_train.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "bvv-7I3jxyPy" }, "source": [ "Como lidar com os valores nulos nas colunas?\n", "\n", "Existem algumas formas para resolver esse problema:\n", "\n", "1. Dropar a coluna inteira. Se a coluna não for muito importante ou tiver pouquíssimos dados, simplesmente a remova.\n", "\n", "2. Continue com a coluna, caso ela seja importante.\n", "\n", "3. Substituir os valores valores nulos por outros de forma que não interfira na análise. (Ex: foward fill, backwards fill, média, etc)\n", "\n", "Foward fill -> o último valor conhecido é usado como referência \n", "\n", "Backward fill -> O valor seguinté é usado como referência para os anteriores \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "yYpAQVYNxyPz" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLocationYearKilometers_DrivenFuel_TypeTransmissionOwner_TypeMileageEnginePowerSeatsPrice
0Maruti Wagon R LXI CNGMumbai201072000CNGManualFirst26.6 km/kg998 CC58.16 bhp5.01.75
1Hyundai Creta 1.6 CRDi SX OptionPune201541000DieselManualFirst19.67 kmpl1582 CC126.2 bhp5.012.50
2Honda Jazz VChennai201146000PetrolManualFirst18.2 kmpl1199 CC88.7 bhp5.04.50
3Maruti Ertiga VDIChennai201287000DieselManualFirst20.77 kmpl1248 CC88.76 bhp7.06.00
4Audi A4 New 2.0 TDI MultitronicCoimbatore201340670DieselAutomaticSecond15.2 kmpl1968 CC140.8 bhp5.017.74
.......................................
6014Maruti Swift VDIDelhi201427365DieselManualFirst28.4 kmpl1248 CC74 bhp5.04.75
6015Hyundai Xcent 1.1 CRDi SJaipur2015100000DieselManualFirst24.4 kmpl1120 CC71 bhp5.04.00
6016Mahindra Xylo D4 BSIVJaipur201255000DieselManualSecond14.0 kmpl2498 CC112 bhp8.02.90
6017Maruti Wagon R VXIKolkata201346000PetrolManualFirst18.9 kmpl998 CC67.1 bhp5.02.65
6018Chevrolet Beat DieselHyderabad201147000DieselManualFirst25.44 kmpl936 CC57.6 bhp5.02.50
\n", "

5975 rows × 12 columns

\n", "
" ], "text/plain": [ " Name Location Year Kilometers_Driven \\\n", "0 Maruti Wagon R LXI CNG Mumbai 2010 72000 \n", "1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 \n", "2 Honda Jazz V Chennai 2011 46000 \n", "3 Maruti Ertiga VDI Chennai 2012 87000 \n", "4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 \n", "... ... ... ... ... \n", "6014 Maruti Swift VDI Delhi 2014 27365 \n", "6015 Hyundai Xcent 1.1 CRDi S Jaipur 2015 100000 \n", "6016 Mahindra Xylo D4 BSIV Jaipur 2012 55000 \n", "6017 Maruti Wagon R VXI Kolkata 2013 46000 \n", "6018 Chevrolet Beat Diesel Hyderabad 2011 47000 \n", "\n", " Fuel_Type Transmission Owner_Type Mileage Engine Power Seats \\\n", "0 CNG Manual First 26.6 km/kg 998 CC 58.16 bhp 5.0 \n", "1 Diesel Manual First 19.67 kmpl 1582 CC 126.2 bhp 5.0 \n", "2 Petrol Manual First 18.2 kmpl 1199 CC 88.7 bhp 5.0 \n", "3 Diesel Manual First 20.77 kmpl 1248 CC 88.76 bhp 7.0 \n", "4 Diesel Automatic Second 15.2 kmpl 1968 CC 140.8 bhp 5.0 \n", "... ... ... ... ... ... ... ... \n", "6014 Diesel Manual First 28.4 kmpl 1248 CC 74 bhp 5.0 \n", "6015 Diesel Manual First 24.4 kmpl 1120 CC 71 bhp 5.0 \n", "6016 Diesel Manual Second 14.0 kmpl 2498 CC 112 bhp 8.0 \n", "6017 Petrol Manual First 18.9 kmpl 998 CC 67.1 bhp 5.0 \n", "6018 Diesel Manual First 25.44 kmpl 936 CC 57.6 bhp 5.0 \n", "\n", " Price \n", "0 1.75 \n", "1 12.50 \n", "2 4.50 \n", "3 6.00 \n", "4 17.74 \n", "... ... \n", "6014 4.75 \n", "6015 4.00 \n", "6016 2.90 \n", "6017 2.65 \n", "6018 2.50 \n", "\n", "[5975 rows x 12 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#drop the column with the missing values - Eliminando a Coluna New Price e Unnamed\n", "car_train.drop(['New_Price', 'Unnamed: 0'], inplace=True, axis= 1) # inplace = True(não vai criar um novo dataFrame) / axis = 1 (especificando uma coluna)\n", "car_train.dropna(inplace=True, axis=0) #Removendo todas as linhas que contenham pelo menos um valor NaN\n", "car_train" ] }, { "cell_type": "markdown", "metadata": { "id": "eawU-0vjxyP0" }, "source": [ "Agora é preciso lidar com as colunas que contém números e strings, que vão atrapalhar na hora de fazer a modelagem. Primeiro vamos verificar o tipo das observações em cada coluna:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5dSqdFRAxyP1", "outputId": "8cdc352a-a179-43c5-9e2d-9ffed23f1b08" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Column datatypes: \n", "Name object\n", "Location object\n", "Year int64\n", "Kilometers_Driven int64\n", "Fuel_Type object\n", "Transmission object\n", "Owner_Type object\n", "Mileage object\n", "Engine object\n", "Power object\n", "Seats float64\n", "Price float64\n", "dtype: object\n" ] } ], "source": [ "# check datatype in each column\n", "print(\"Column datatypes: \")\n", "print(car_train.dtypes)" ] }, { "cell_type": "markdown", "metadata": { "id": "JmlbrnzpxyP2" }, "source": [ "Como podemos ver, as colunas Mileage, Engine e Power, que deveriam estar em int ou float, estão em string. Para arrumar isso:\n", "\n", "1. Tirar as unidades de medida de todas as observações\n", "2. Transformar a coluna em tipo numérico (int ou float)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "GGVXdMKBxyP2", "outputId": "04af7825-4ca0-4fd4-a36b-e0542d387597" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLocationYearKilometers_DrivenFuel_TypeTransmissionOwner_TypeMileageEnginePowerSeatsPrice
0Maruti Wagon R LXI CNGMumbai201072000CNGManualFirst26.60998.058.165.01.75
1Hyundai Creta 1.6 CRDi SX OptionPune201541000DieselManualFirst19.671582.0126.205.012.50
2Honda Jazz VChennai201146000PetrolManualFirst18.201199.088.705.04.50
3Maruti Ertiga VDIChennai201287000DieselManualFirst20.771248.088.767.06.00
4Audi A4 New 2.0 TDI MultitronicCoimbatore201340670DieselAutomaticSecond15.201968.0140.805.017.74
.......................................
6014Maruti Swift VDIDelhi201427365DieselManualFirst28.401248.074.005.04.75
6015Hyundai Xcent 1.1 CRDi SJaipur2015100000DieselManualFirst24.401120.071.005.04.00
6016Mahindra Xylo D4 BSIVJaipur201255000DieselManualSecond14.002498.0112.008.02.90
6017Maruti Wagon R VXIKolkata201346000PetrolManualFirst18.90998.067.105.02.65
6018Chevrolet Beat DieselHyderabad201147000DieselManualFirst25.44936.057.605.02.50
\n", "

5872 rows × 12 columns

\n", "
" ], "text/plain": [ " Name Location Year Kilometers_Driven \\\n", "0 Maruti Wagon R LXI CNG Mumbai 2010 72000 \n", "1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 \n", "2 Honda Jazz V Chennai 2011 46000 \n", "3 Maruti Ertiga VDI Chennai 2012 87000 \n", "4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 \n", "... ... ... ... ... \n", "6014 Maruti Swift VDI Delhi 2014 27365 \n", "6015 Hyundai Xcent 1.1 CRDi S Jaipur 2015 100000 \n", "6016 Mahindra Xylo D4 BSIV Jaipur 2012 55000 \n", "6017 Maruti Wagon R VXI Kolkata 2013 46000 \n", "6018 Chevrolet Beat Diesel Hyderabad 2011 47000 \n", "\n", " Fuel_Type Transmission Owner_Type Mileage Engine Power Seats Price \n", "0 CNG Manual First 26.60 998.0 58.16 5.0 1.75 \n", "1 Diesel Manual First 19.67 1582.0 126.20 5.0 12.50 \n", "2 Petrol Manual First 18.20 1199.0 88.70 5.0 4.50 \n", "3 Diesel Manual First 20.77 1248.0 88.76 7.0 6.00 \n", "4 Diesel Automatic Second 15.20 1968.0 140.80 5.0 17.74 \n", "... ... ... ... ... ... ... ... ... \n", "6014 Diesel Manual First 28.40 1248.0 74.00 5.0 4.75 \n", "6015 Diesel Manual First 24.40 1120.0 71.00 5.0 4.00 \n", "6016 Diesel Manual Second 14.00 2498.0 112.00 8.0 2.90 \n", "6017 Petrol Manual First 18.90 998.0 67.10 5.0 2.65 \n", "6018 Diesel Manual First 25.44 936.0 57.60 5.0 2.50 \n", "\n", "[5872 rows x 12 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#replace with blank and transform to float\n", "car_train['Mileage'] = car_train['Mileage'].str.replace('kmpl', '').str.replace('km/kg', '')\n", "car_train['Engine'] = car_train['Engine'].str.replace('CC', '')\n", "car_train['Power'] = car_train['Power'].str.replace('bhp', '')\n", "\n", "col = ['Mileage','Engine','Power'] # Lista com o nome das colunas que serão convertidas em números \n", "\n", "car_train[col] = car_train[col].apply(pd.to_numeric, errors= 'coerce',axis=1) #errors = \"coerce\" (se ocorrer algum erro, os valores serão definidos como Nan)\n", "\n", "# raise (padrão) = o processo é interrompido \n", "# ignore = ignora o erro \n", "# [\"raise\", \"coerce\"] -> Podemos usar em formato de lista, dessa forma podemos controlar o comportamentos de diferentes colunas \n", "\n", "car_train.dropna(inplace=True, axis=0) # Removendo os NaN novamente \n", "\n", "car_train" ] }, { "cell_type": "markdown", "metadata": { "id": "z8X0OKKWxyP3" }, "source": [ "Agora que resolvemos com as colunas que tinham valores em strings e numéricos, precisamos transformar em código as colunas com dados categóricos (Name, Location, Fuel_Type, Transmission):" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "wj9DskJOxyP3" }, "outputs": [], "source": [ "from sklearn import preprocessing" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "xl8sreurxyP3", "outputId": "f68388a0-5b7c-4fcc-ffcf-6b7d94936d7e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLocationYearKilometers_DrivenFuel_TypeTransmissionOwner_TypeMileageEnginePowerSeatsPrice
0Maruti Wagon R LXI CNG920107200001126.60998.058.165.01.75
1Hyundai Creta 1.6 CRDi SX Option1020154100011119.671582.0126.205.012.50
2Honda Jazz V220114600031118.201199.088.705.04.50
3Maruti Ertiga VDI220128700011120.771248.088.767.06.00
4Audi A4 New 2.0 TDI Multitronic320134067010215.201968.0140.805.017.74
.......................................
6014Maruti Swift VDI420142736511128.401248.074.005.04.75
6015Hyundai Xcent 1.1 CRDi S6201510000011124.401120.071.005.04.00
6016Mahindra Xylo D4 BSIV620125500011214.002498.0112.008.02.90
6017Maruti Wagon R VXI820134600031118.90998.067.105.02.65
6018Chevrolet Beat Diesel520114700011125.44936.057.605.02.50
\n", "

5872 rows × 12 columns

\n", "
" ], "text/plain": [ " Name Location Year Kilometers_Driven \\\n", "0 Maruti Wagon R LXI CNG 9 2010 72000 \n", "1 Hyundai Creta 1.6 CRDi SX Option 10 2015 41000 \n", "2 Honda Jazz V 2 2011 46000 \n", "3 Maruti Ertiga VDI 2 2012 87000 \n", "4 Audi A4 New 2.0 TDI Multitronic 3 2013 40670 \n", "... ... ... ... ... \n", "6014 Maruti Swift VDI 4 2014 27365 \n", "6015 Hyundai Xcent 1.1 CRDi S 6 2015 100000 \n", "6016 Mahindra Xylo D4 BSIV 6 2012 55000 \n", "6017 Maruti Wagon R VXI 8 2013 46000 \n", "6018 Chevrolet Beat Diesel 5 2011 47000 \n", "\n", " Fuel_Type Transmission Owner_Type Mileage Engine Power Seats \\\n", "0 0 1 1 26.60 998.0 58.16 5.0 \n", "1 1 1 1 19.67 1582.0 126.20 5.0 \n", "2 3 1 1 18.20 1199.0 88.70 5.0 \n", "3 1 1 1 20.77 1248.0 88.76 7.0 \n", "4 1 0 2 15.20 1968.0 140.80 5.0 \n", "... ... ... ... ... ... ... ... \n", "6014 1 1 1 28.40 1248.0 74.00 5.0 \n", "6015 1 1 1 24.40 1120.0 71.00 5.0 \n", "6016 1 1 2 14.00 2498.0 112.00 8.0 \n", "6017 3 1 1 18.90 998.0 67.10 5.0 \n", "6018 1 1 1 25.44 936.0 57.60 5.0 \n", "\n", " Price \n", "0 1.75 \n", "1 12.50 \n", "2 4.50 \n", "3 6.00 \n", "4 17.74 \n", "... ... \n", "6014 4.75 \n", "6015 4.00 \n", "6016 2.90 \n", "6017 2.65 \n", "6018 2.50 \n", "\n", "[5872 rows x 12 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['Location', 'Fuel_Type', 'Transmission'] # Listta das colunas que queremos alterar \n", "\n", "car_train[cols] = car_train[cols].apply(preprocessing.LabelEncoder().fit_transform) # LabelEncoder -> ferramenta para transformar rótulos categóricos em valores numéricos\n", "\n", "car_train" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "pr6BiBY6xyP4", "outputId": "f5efe85c-78d7-404a-9654-bbc22322861f" }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 4, 3], dtype=int64)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "car_train.Owner_Type.unique() #valores ordinais (não métrico)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "id": "4SNqTJa9xyP5" }, "outputs": [], "source": [ "owner = {'First':1,'Second':2, 'Third':3,'Fourth & Above':4}\n", "\n", "car_train.Owner_Type = car_train.Owner_Type.replace(owner)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "id": "Pfriz8kuxyP5", "outputId": "04d8f544-999b-4f4c-d5ca-59372cd08ccf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLocationYearKilometers_DrivenFuel_TypeTransmissionOwner_TypeMileageEnginePowerSeatsPrice
0Maruti Wagon R LXI CNG920107200001126.60998.058.165.01.75
1Hyundai Creta 1.6 CRDi SX Option1020154100011119.671582.0126.205.012.50
2Honda Jazz V220114600031118.201199.088.705.04.50
3Maruti Ertiga VDI220128700011120.771248.088.767.06.00
4Audi A4 New 2.0 TDI Multitronic320134067010215.201968.0140.805.017.74
.......................................
6014Maruti Swift VDI420142736511128.401248.074.005.04.75
6015Hyundai Xcent 1.1 CRDi S6201510000011124.401120.071.005.04.00
6016Mahindra Xylo D4 BSIV620125500011214.002498.0112.008.02.90
6017Maruti Wagon R VXI820134600031118.90998.067.105.02.65
6018Chevrolet Beat Diesel520114700011125.44936.057.605.02.50
\n", "

5872 rows × 12 columns

\n", "
" ], "text/plain": [ " Name Location Year Kilometers_Driven \\\n", "0 Maruti Wagon R LXI CNG 9 2010 72000 \n", "1 Hyundai Creta 1.6 CRDi SX Option 10 2015 41000 \n", "2 Honda Jazz V 2 2011 46000 \n", "3 Maruti Ertiga VDI 2 2012 87000 \n", "4 Audi A4 New 2.0 TDI Multitronic 3 2013 40670 \n", "... ... ... ... ... \n", "6014 Maruti Swift VDI 4 2014 27365 \n", "6015 Hyundai Xcent 1.1 CRDi S 6 2015 100000 \n", "6016 Mahindra Xylo D4 BSIV 6 2012 55000 \n", "6017 Maruti Wagon R VXI 8 2013 46000 \n", "6018 Chevrolet Beat Diesel 5 2011 47000 \n", "\n", " Fuel_Type Transmission Owner_Type Mileage Engine Power Seats \\\n", "0 0 1 1 26.60 998.0 58.16 5.0 \n", "1 1 1 1 19.67 1582.0 126.20 5.0 \n", "2 3 1 1 18.20 1199.0 88.70 5.0 \n", "3 1 1 1 20.77 1248.0 88.76 7.0 \n", "4 1 0 2 15.20 1968.0 140.80 5.0 \n", "... ... ... ... ... ... ... ... \n", "6014 1 1 1 28.40 1248.0 74.00 5.0 \n", "6015 1 1 1 24.40 1120.0 71.00 5.0 \n", "6016 1 1 2 14.00 2498.0 112.00 8.0 \n", "6017 3 1 1 18.90 998.0 67.10 5.0 \n", "6018 1 1 1 25.44 936.0 57.60 5.0 \n", "\n", " Price \n", "0 1.75 \n", "1 12.50 \n", "2 4.50 \n", "3 6.00 \n", "4 17.74 \n", "... ... \n", "6014 4.75 \n", "6015 4.00 \n", "6016 2.90 \n", "6017 2.65 \n", "6018 2.50 \n", "\n", "[5872 rows x 12 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "car_train" ] }, { "cell_type": "markdown", "metadata": { "id": "LT5TZyTxyYi0" }, "source": [ "### Análise exploratória" ] }, { "cell_type": "markdown", "metadata": { "id": "mGXH4gKSxyP5" }, "source": [ "Com a nossa base de dados completamente limpa, vamos analisar a correlação entre as variáveis que vamos colocar no modelo de forma a aumentar sua acurácia.\n", "\n", "\n", "\n", "Existem diferentes testes para diferentes situações, dependendo do tipo das variáveis." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "id": "vUbDvgeBxyP6", "outputId": "76653453-2172-4958-c01b-35bf6aa9ed67" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\filip\\AppData\\Local\\Temp\\ipykernel_12856\\637126045.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", " car_train.corr(method='pearson')['Price'] # Correlação das variáveis com o preço, quanto mais próximo de 1, maior a correlação entre as variáveis\n" ] }, { "data": { "text/plain": [ "Location -0.118238\n", "Year 0.299475\n", "Kilometers_Driven -0.008249\n", "Fuel_Type -0.301626\n", "Transmission -0.585623\n", "Owner_Type -0.091098\n", "Mileage -0.341652\n", "Engine 0.658047\n", "Power 0.772843\n", "Seats 0.055547\n", "Price 1.000000\n", "Name: Price, dtype: float64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Análise do correlograma\n", "car_train.corr(method='pearson')['Price'] # Correlação das variáveis com o preço, quanto mais próximo de 1, maior a correlação entre as variáveis \n", "# Pelo que tivemos de retorno ( Transmissão), não faz sentido a relação de uma variável categórica com uma contínua \n", "# Correlação negativa - sentidos negativos (ex: mais donos menor o preço )" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "id": "CcjgDdvrxyP6", "outputId": "a9d8fb53-9985-48df-b91c-ee1553e8da59" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Usando moddelo de regressão pra verificar como que cada variável vai impactar o modelo \n", "\n", "from sklearn.feature_selection import mutual_info_regression \n", "import matplotlib.pyplot as plt\n", "\n", "X, y = car_train.drop(['Price', 'Name'], axis=1), car_train['Price'] #Escolhendo todas variáveis menos o preço e o nome e escolhendo o preço como objetivo \n", "\n", "importances = mutual_info_regression(X, y) # Calculando a importância de cada característica com a variável Preço \n", "feat_importances = pd.Series(importances, car_train.columns[1:len(car_train.columns)-1]) #Criando uma série para armazenar as pontuações calculadas\n", "feat_importances.plot(kind='barh') # Vizualização das características através de um gráfico \n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "ibOYVh-WxyP7", "outputId": "bacb450e-4331-450b-fedb-b7a6aab71b25" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Usando outro modelo de regressão \n", "\n", "from sklearn.ensemble import ExtraTreesRegressor\n", "\n", "tree = ExtraTreesRegressor(n_estimators=50)\n", "tree = tree.fit(X, y)\n", "\n", "feat_importances = pd.Series(tree.feature_importances_, car_train.columns[1:len(car_train.columns)-1])\n", "feat_importances.plot(kind='barh')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "BAeDUR4hxyP8" }, "source": [ "### Normalização dos dados" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "id": "H7Dz5qTwxyP9", "outputId": "75f7b1cc-eb47-4037-d4c3-fcd78e78c5ad" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLocationYearKilometers_DrivenFuel_TypeTransmissionOwner_TypeMileageEnginePowerSeatsPrice
0Maruti Wagon R LXI CNG920100.0110510110.7930830.0695940.0455695.01.75
1Hyundai Creta 1.6 CRDi SX Option1020150.0062821110.5864640.1782660.1749715.012.50
2Honda Jazz V220110.0070513110.5426360.1069970.1036525.04.50
3Maruti Ertiga VDI220120.0133591110.6192610.1161150.1037667.06.00
4Audi A4 New 2.0 TDI Multitronic320130.0062311020.4531900.2500930.2027395.017.74
.......................................
6014Maruti Swift VDI420140.0041841110.8467500.1161150.0756945.04.75
6015Hyundai Xcent 1.1 CRDi S620150.0153591110.7274900.0922960.0699895.04.00
6016Mahindra Xylo D4 BSIV620120.0084351120.4174120.3487160.1479658.02.90
6017Maruti Wagon R VXI820130.0070513110.5635060.0695940.0625715.02.65
6018Chevrolet Beat Diesel520110.0072051110.7584970.0580570.0445045.02.50
\n", "

5872 rows × 12 columns

\n", "
" ], "text/plain": [ " Name Location Year Kilometers_Driven \\\n", "0 Maruti Wagon R LXI CNG 9 2010 0.011051 \n", "1 Hyundai Creta 1.6 CRDi SX Option 10 2015 0.006282 \n", "2 Honda Jazz V 2 2011 0.007051 \n", "3 Maruti Ertiga VDI 2 2012 0.013359 \n", "4 Audi A4 New 2.0 TDI Multitronic 3 2013 0.006231 \n", "... ... ... ... ... \n", "6014 Maruti Swift VDI 4 2014 0.004184 \n", "6015 Hyundai Xcent 1.1 CRDi S 6 2015 0.015359 \n", "6016 Mahindra Xylo D4 BSIV 6 2012 0.008435 \n", "6017 Maruti Wagon R VXI 8 2013 0.007051 \n", "6018 Chevrolet Beat Diesel 5 2011 0.007205 \n", "\n", " Fuel_Type Transmission Owner_Type Mileage Engine Power \\\n", "0 0 1 1 0.793083 0.069594 0.045569 \n", "1 1 1 1 0.586464 0.178266 0.174971 \n", "2 3 1 1 0.542636 0.106997 0.103652 \n", "3 1 1 1 0.619261 0.116115 0.103766 \n", "4 1 0 2 0.453190 0.250093 0.202739 \n", "... ... ... ... ... ... ... \n", "6014 1 1 1 0.846750 0.116115 0.075694 \n", "6015 1 1 1 0.727490 0.092296 0.069989 \n", "6016 1 1 2 0.417412 0.348716 0.147965 \n", "6017 3 1 1 0.563506 0.069594 0.062571 \n", "6018 1 1 1 0.758497 0.058057 0.044504 \n", "\n", " Seats Price \n", "0 5.0 1.75 \n", "1 5.0 12.50 \n", "2 5.0 4.50 \n", "3 7.0 6.00 \n", "4 5.0 17.74 \n", "... ... ... \n", "6014 5.0 4.75 \n", "6015 5.0 4.00 \n", "6016 8.0 2.90 \n", "6017 5.0 2.65 \n", "6018 5.0 2.50 \n", "\n", "[5872 rows x 12 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#fazer com que o modelo entenda melhor a grandeza dos dados, nesse caso vamos fazer os dados listados terem um \"range\" de 0 a 1\n", "\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "scaler = MinMaxScaler()\n", "\n", "car_train[['Power', 'Engine', 'Mileage', 'Kilometers_Driven']] = scaler.fit_transform(car_train[['Power', 'Engine', 'Mileage', 'Kilometers_Driven']])\n", "\n", "car_train\n", "\n", "# Existem variáveis que não fazem sentido normalizar - qualificações " ] }, { "cell_type": "markdown", "metadata": { "id": "Z1-ARAkAxyP9" }, "source": [ "## **Parte 2 - Modelagem**" ] }, { "cell_type": "markdown", "metadata": { "id": "0GVYpd78xyP9" }, "source": [ "### Treino e teste\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XvFtilbu1Sdt" }, "source": [ "\n", "O primeiro passo para criar qualquer modelo de aprendizado de máquina é dividir os dados em conjuntos de 'treinamento', 'teste' e 'validação'. O conjunto de validação é opcional, mas muito importante se você planeja implantar o modelo na vida real.\n", "\n", "Mas por que a validação é importante?\n", "\n", "O conjunto 'train' é usado para treinamento, o conjunto 'test' é usado para executar as previsões e é com essas previsões que os hiperparâmetros são ajustados e o modelo é treinado novamente para melhor precisão. Assim, você pode ver que, às vezes, se você ajustar esses parâmetros, o modelo pode ser tendencioso para fornecer uma boa previsão apenas no conjunto de teste e não em qualquer conjunto geral." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "IalLFvHFxyP-", "outputId": "effd50ca-deb9-4338-8182-ac3d21a7232a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PowerEngineTransmissionYearMileage
26920.2963100.255117020110.676506
700.8858880.778191020080.253429
59580.1224800.162635120150.769231
17980.0905290.106624120160.655933
34150.0756940.116115120160.751342
..................
50590.1032710.116115120170.724508
33510.1061240.143655120140.685748
16990.2069230.348902120070.324985
26830.0624570.069594120170.688730
28100.0624570.069594120160.611509
\n", "

4110 rows × 5 columns

\n", "
" ], "text/plain": [ " Power Engine Transmission Year Mileage\n", "2692 0.296310 0.255117 0 2011 0.676506\n", "70 0.885888 0.778191 0 2008 0.253429\n", "5958 0.122480 0.162635 1 2015 0.769231\n", "1798 0.090529 0.106624 1 2016 0.655933\n", "3415 0.075694 0.116115 1 2016 0.751342\n", "... ... ... ... ... ...\n", "5059 0.103271 0.116115 1 2017 0.724508\n", "3351 0.106124 0.143655 1 2014 0.685748\n", "1699 0.206923 0.348902 1 2007 0.324985\n", "2683 0.062457 0.069594 1 2017 0.688730\n", "2810 0.062457 0.069594 1 2016 0.611509\n", "\n", "[4110 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "2692 10.30\n", "70 14.50\n", "5958 4.68\n", "1798 5.10\n", "3415 5.63\n", " ... \n", "5059 9.05\n", "3351 4.50\n", "1699 3.00\n", "2683 4.99\n", "2810 3.60\n", "Name: Price, Length: 4110, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "#Vamos treinar a máquina \n", "# Depois vamos mostrar apenas as variáveis para a máquina tentar predizer essas variáveis - Depois é possível comparar com a base real\n", "# Validação - usar o modelo em base de dados que não foram treinadas (overfeeting (vai muito bem nessa base de dados) e underfeeting (poucos dados) )\n", "\n", "\n", "#Definição de variáveis\n", "X, y = car_train[['Power', 'Engine', 'Transmission', 'Year', 'Mileage']], car_train['Price']\n", "# X, y = car_train[['Power','Transmission']], car_train['Price']\n", "# X, y = car_train[['Year','Kilometers_Driven']], car_train['Price']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size= 0.7) #vai randomizar a base de dados e vai fazer \n", " # um corte pra base de treino (pode ser determinado) - \n", " # Não pode usar em time series \n", " # (por que existe uma tensência durante o tempo)\n", "\n", "\n", "display(X_train,y_train)" ] }, { "cell_type": "markdown", "metadata": { "id": "TssxaE4HxyP_" }, "source": [ "### Escolha do modelo - Nosso modelos: regressão boosted" ] }, { "cell_type": "markdown", "metadata": { "id": "TO8V81Xf1ODO" }, "source": [ "\n", "**Machine Learning Algorithms Cheat Sheet**\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "NpAOLVn8xyP_" }, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ovMcgsm2xyQA" }, "source": [ "Para esse exercício, usarei uma regressão de Boosted Decision Tree, pois é um modelo um pouco mais complexo que tem um tempo de treinamento pequeno. Esse algorítmo funciona da seguinte forma: Junta árvores de decisões \n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "ZBVoGW9zxyQA" }, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "reg = GradientBoostingRegressor(n_estimators = 300, learning_rate = 0.5, max_depth=100) #mudanos o número de estimadores e aumentamos as interações \n", "reg = reg.fit(X_train, y_train) # fit é o treino - é usado para treinar o modelo usando os dados de treinamento fornecidos. \n", " # Durante o treinamento, o modelo ajusta seus parâmetros internos para minimizar a diferença entre as previsões \n", " # feitas pelo modelo e os valores reais observados nos dados de treinamento.\n", "\n", "y_pred = reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lGVMc2pLxyQB", "outputId": "3c5ff076-e74a-41a6-a5db-00551d0344bc" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
Price
0.851.050
10.9415.250
3.002.250
16.0017.250
4.953.000
......
3.402.745
30.3722.660
10.469.872
3.002.000
32.9031.150
\n", "

1762 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "Price \n", "0.85 1.050\n", "10.94 15.250\n", "3.00 2.250\n", "16.00 17.250\n", "4.95 3.000\n", "... ...\n", "3.40 2.745\n", "30.37 22.660\n", "10.46 9.872\n", "3.00 2.000\n", "32.90 31.150\n", "\n", "[1762 rows x 1 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(y_pred, y_test) #esquerda o preço real / direira é a nossa predição " ] }, { "cell_type": "markdown", "metadata": { "id": "ArAjPWHXxyQB" }, "source": [ "### Validação\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YZVcTeNZ04NC" }, "source": [ "\n", "Com o modelo treinado, agora precisamos verificar e validar o modelo de acordo com a sua acurácia.\n", "\n", "Como se trata de um problema de REGRESSÃO, devemos obrigatoriamente utilizar métricas de validação adequadas para REGRESSÕES:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9RujeVFPxyQC", "outputId": "3c5df826-66a7-4289-d03d-da2e2d72e63c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean absolute error (MAE) = 1.93\n", "Mean absolute percentage error (MAPE) = 20.05%\n", "Mean squared error (MSE) = 23.34\n", "Root mean squared error (RMSE) = 4.83\n" ] } ], "source": [ "import sklearn.metrics as sm\n", "\n", "#Métricas simples\n", "\n", "print(\"Mean absolute error (MAE) =\", round(sm.mean_absolute_error(y_test, y_pred), 2)) #Erro médio absoluto \n", "print(f\"Mean absolute percentage error (MAPE) = {round(sm.mean_absolute_percentage_error(y_test, y_pred), 4): 0.2%}\") #Erro médio absoluto em %\n", "print(\"Mean squared error (MSE) =\", round(sm.mean_squared_error(y_test, y_pred), 2)) #Erro médio quadrado \n", "print(f'Root mean squared error (RMSE) = {round(sm.mean_squared_error(y_test, y_pred, squared= False),2)}') #Raiz quadrada da média dos erros - Quanto menor menlhor \n", " # mede a média das diferenças quadradas entre os valores \n", " # observados (reais) e os valores previstos pelo modelo." ] }, { "cell_type": "markdown", "metadata": { "id": "A_R-RrSyxyQC" }, "source": [ "Para testar se o nosso modelo está overfittado, vamos utilizar a técnica de KFold para analisar a performance do modelo em diversas situações de treino e teste (não pode aplicar para time series):\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "id": "_mhr_mhXxyQD", "outputId": "0816b1b3-0b9c-4700-ba1c-b6fd99b131cd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Acurácia da validação cruzada K-Fold: 4.675 +/- 0.425\n" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "#Validação por KFold\n", "\n", "scores = cross_val_score(reg, X=X_train, y=y_train, cv=5, n_jobs=1, scoring= 'neg_root_mean_squared_error') #cv - Número de Folds, dados divididos em 5 partes \n", " # n_jobs = o calculo será executado em apenas um núcleo \n", " # RMSE - erro médio quadratico como forma de avaliação \n", "\n", "print('Acurácia da validação cruzada K-Fold: %.3f +/- %.3f' % (-np.mean(scores),np.std(scores)))\n", "\n", "#Erro médio quadrado +/-\n", "# Resultado - A média do RMSE através das 5 iterações da validação cruzada é aproximadamente 4.675.\n", "# O desvio padrão do RMSE é 0.425.\n", "# O desvio padrão de 0.425 em comparação com o valor médio de 4.675 é relativamente pequeno, \n", "# sugerindo que o modelo é consistente entre as diferentes iterações da validação cruzada. \n", "# Isso é um bom sinal e indica que o modelo não varia muito entre os diferentes folds.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "j1ZWihiXxyQE" }, "source": [ "### Ensemble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Técnica de aprendizado de máquina que combina as previsões de múltiplos modelos individuais para melhorar a precisão e a robustez das previsões. A ideia por trás dos métodos ensemble é que a combinação de vários modelos pode capturar uma gama mais ampla de padrões e variações nos dados, reduzindo assim o risco de overfitting e melhorando a performance geral.\n", "\n", " 1- Bagging (Bootstrap Aggregating):\n", " \n", " a) Random Forest: Um dos exemplos mais conhecidos de bagging, onde múltiplas árvores de decisão são treinadas em diferentes subconjuntos dos dados de treinamento (obtidos por bootstrap) e suas previsões são combinadas (por votação para classificação ou média para regressão).\n", " \n", " 2- Boosting:\n", "\n", " a) Gradient Boosting Machines (GBM): Modelos são treinados sequencialmente, com cada novo modelo corrigindo os erros dos modelos anteriores. Exemplos populares incluem XGBoost, LightGBM, e CatBoost.\n", "\n", " b) AdaBoost: Um método que ajusta o peso das observações com base no erro dos modelos anteriores, focando mais em observações difíceis.\n", " \n", " 3- Stacking:\n", " \n", " a) Modelos individuais (de diferentes tipos ou parâmetros) são treinados e suas previsões são usadas como entradas para um modelo final (meta-modelo), que aprende a melhor combinação dessas previsões.\n", " \n", " 4- Voting:Para problemas de classificação, várias previsões de modelos são combinadas por votação majoritária (votação simples) ou ponderada (votação ponderada) para decidir a classe final" ] }, { "cell_type": "markdown", "metadata": { "id": "wV3zt5Sc0yxf" }, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rS3tvX6AxyQF", "outputId": "710e0fa6-77ba-453c-e169-110cff68fd7d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNeighbors 4.629 (0.175)\n", "DecisionTree 4.709 (0.559)\n", "XGBoost 3.758 (0.385)\n", "GradientBoosting 4.691 (0.408)\n", "Stacking 3.742 (0.357)\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import StackingRegressor\n", "import xgboost as xgb\n", "\n", "from sklearn.model_selection import KFold\n", "\n", "from sklearn.utils._testing import ignore_warnings\n", "from sklearn.exceptions import ConvergenceWarning\n", "\n", "# ensemble do modelo por stacking\n", "@ignore_warnings(category=ConvergenceWarning)\n", "\n", "def get_stacking():\n", "\t# modelos base\n", "\tlevel0 = list()\n", "\tlevel0.append(('KNeighbors', KNeighborsRegressor()))\n", "\tlevel0.append(('XGBoost', xgb.XGBRegressor()))\n", "\tlevel0.append(('Decision', DecisionTreeRegressor()))\n", "\tlevel0.append(('GradientBoosting', GradientBoostingRegressor()))\n", "\t# modelo meta learner\n", "\tlevel1 = LinearRegression()\n", "\tmodel = StackingRegressor(estimators=level0, final_estimator=level1, cv=5)\n", "\treturn model\n", "\n", "# modelos para avaliar\n", "def get_models():\n", "\tmodels = dict()\n", "\tmodels['KNeighbors'] = KNeighborsRegressor()\n", "\tmodels['DecisionTree'] = DecisionTreeRegressor()\n", "\tmodels['XGBoost'] = xgb.XGBRegressor()\n", "\tmodels['GradientBoosting'] = GradientBoostingRegressor(n_estimators = 300, learning_rate = 0.5, max_depth=100)\n", "\tmodels['Stacking'] = get_stacking()\n", "\treturn models\n", "\n", "# avaliação dos modelos por cross-validation\n", "def evaluate_model(model, X, y):\n", "\tscores = cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1, error_score='raise')\n", "\treturn scores\n", "\n", "models = get_models()\n", "\n", "results, names = list(), list()\n", "for name, model in models.items():\n", "\tscores = evaluate_model(model, X_train, y_train)\n", "\tresults.append(scores)\n", "\tnames.append(name)\n", "\tprint('%s %.3f (%.3f)' % (name, -np.mean(scores), np.std(scores)))\n", " \n", "# os resultados da avaliação dos modelos por cross-validation indicam a performance de cada modelo na tarefa de regressão.\n", "# métrica de avaliação: RMSE " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpretação\n", "\n", "1- O XGBoost apresenta o menor RMSE médio entre os modelos individuais, indicando uma boa performance na previsão.\n", "\n", "2- O modelo Stacking também tem um RMSE médio competitivo, sugerindo que a combinação dos modelos base está ajudando a melhorar a precisão das previsões.\n", "\n", "3- Os modelos KNeighbors e DecisionTree têm RMSEs médios mais altos, indicando uma performance inferior na tarefa de previsão em comparação com o XGBoost e o Stacking.\n", "\n", "4- O desvio padrão em todos os modelos é relativamente baixo, o que sugere uma consistência razoável nas previsões entre as dobras de cross-validation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GNMnlGTtxyQG" }, "outputs": [], "source": [ "model = model.fit(X_train, y_train)\n", "\n", "y_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7U21wlrOxyQH", "outputId": "b9485758-b25f-4457-f15b-ba3774282111" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean absolute error (MAE) = 1.7\n", "Mean absolute percentage error (MAPE) = 17.45%\n", "Mean squared error (MSE) = 18.92\n", "Root mean squared error (RMSE) = 4.35\n" ] } ], "source": [ "#Métricas simples\n", "\n", "print(\"Mean absolute error (MAE) =\", round(sm.mean_absolute_error(y_test, y_pred), 2))\n", "print(f\"Mean absolute percentage error (MAPE) = {round(sm.mean_absolute_percentage_error(y_test, y_pred), 4): 0.2%}\")\n", "print(\"Mean squared error (MSE) =\", round(sm.mean_squared_error(y_test, y_pred), 2))\n", "print(f'Root mean squared error (RMSE) = {round(sm.mean_squared_error(y_test, y_pred, squared= False),2)}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# errou menos do que o último modelo " ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "2eaf2605f86d6c7bf8701796461e0310715783929ef3cb9e7757643455c26c87" } } }, "nbformat": 4, "nbformat_minor": 0 }